Pages
Shape Links
Shape Properties
OCR and OpenAI Processes
OCR and OpenAI Processes
Architecture
Architecture
Document Intelligence - Classification
Document Intelligence - Classification
Manual classification
Manual classification
File Architecture
File Architecture
Process flow
Process flow
Decision Tree
Decision Tree
Phase
Phase
External/Output
External/Output
Azure Resource
Azure Resource
Sample - uncluster
Sample - uncluster
Sample - Label
Sample - Label
Schema1
Schema1
Schema2
Schema2
Schema...
Schema...
Schema28
Schema28
Population - uncluster
Population - uncluster
Population - cluster
Population - cluster
Extraction Model
Extraction Model
Unknown
Unknown
Template 2
Template 2
Template 3
Template 3
Template 4
Template 4
Template 1
Template 1
Population - Cluster
Population - Cluster
Schema1
Schema1
Schema2
Schema2
Schema...
Schema...
Schema43
Schema43
Sample - Label
Sample - Label
Schema1
Schema1
Schema2
Schema2
Schema...
Schema...
Schema43
Schema43
Population - Cluster
Population - Cluster
Schema1
Schema1
Schema2
Schema2
Schema...
Schema...
Schema43
Schema43
Population - uncluster
Population -
uncluster
template1
template1
template2
template2
template99
template99
template1
template1
template2
template2
template99
template99
trainingsamples
trainingsamples
population
population
pdf
pdf
websiteinfo
websiteinfo
json
json
jsondata
jsondata
jsonmodified
jsonmodified
json
json
pdf
pdf
jsondata
jsondata
jsonresponses
jsonresponses
Storage Account
Storage Account
OpenAI
OpenAI
Document Intelligence Script
Document Intelligence
Script
SQL Warehouse
SQL Warehouse
QA/QC
QA/QC
Same as PDFs
Same as PDFs
Each document has a unique document ID from Project ID extrac...
Each document has a unique
document ID from Project ID
extracted by OpenAI
Same names as PDFs
Same names as PDFs
Unique document names
Unique document names
Population - cluster
Population - cluster
Unknown
Unknown
Template 2
Template 2
Template 3
Template 3
Template 4
Template 4
Template 1
Template 1
Extraction Model
Extraction Model
websiteinfo
websiteinfo
OpenAI Script
OpenAI Script
review
review
donottouch
donottouch
Document Intelligence Script
Document Intelligence
Script
Folder 1
Folder 1
Folder 2
Folder 2
Folder 99
Folder 99
PDF’s with CSV + Prompt
PDF’s with CSV + Prompt
Jsondata (source of truth)
Jsondata
(source of truth)
jsonmodified (Clone)
jsonmodified
(Clone)
Copies 1:1
Copies
1:1
Reads/writes API calls
Reads/writes
API calls
overwritten
overwritten
Listener 1:1
Listener
1:1
Business Intelligence Tools
Business
Intelligence
Tools
Frequent access
Frequent
access
Raw PDFs
Raw PDFs
Document Intelligence
Document
Intelligence
creates
creates
access
access
access
access
Step 2: Train Classification Model
Step 2: Train Classification
Model
Step 1: Identify minimum of 10 samples for each schema
Step 1: Identify minimum of
10 samples for each schema
Step 2b: Upload
Step 2b: Upload
Within each model, select required fields
Within each
model, select
required fields
Step 4: Run Extraction Model
Step 4: Run Extraction
Model
Step 3: Run Classification Model
Step 3: Run Classification
Model
Step 1: Manually Figure out Schemas/Formats
Step 1: Manually Figure out
Schemas/Formats
Step 2: Train Classification Model
Step 2: Train Classification
Model
Step 2b: Upload
Step 2b: Upload
Step 3: Run Classification Model
Step 3: Run Classification
Model
OCR
OCR
OpenAI
OpenAI
Document type
Document type
Azure Blob Storage Account Container
Azure Blob Storage
Account Container
SQL Warehouse
SQL Warehouse
donottouch
donottouch
jsondata
jsondata
Classification
Classification
Event Grid
Event Grid
Trigger on new PDFs uploaded
Trigger on new PDFs uploaded
1. JSON duplication check 2. Creates JSONS
1. JSON duplication check 2. Creates JSONS
Delta Migration
Delta
Migration
jsonresponses
jsonresponses
creates
creates
jsonmodified
jsonmodified
references
references
updates
updates
OpenAI (SSC)
OpenAI (SSC)
References API
References API
...
...
...
...
...
...
...
...
Within each model, select required fields
Within each
model, select
required fields
Step 4: Run Extraction Model
Step 4: Run Extraction
Model
Want to perform analysis on data.
Want to perform analysis
on data.
Is the document digitized?
Is the document
digitized?
What is the document format?
What is the
document format?
Not a supported format. Research phase to convert videos into...
Not a supported format.
Research phase to convert
videos into frames
Video
Video
Format is supported; still research phase
Format is supported; still
research phase
Image
Image
Is OpenAI integration required?
Is OpenAI
integration
required?
Text
Text
Access Data in SQL Warehouse
Access Data in SQL
Warehouse
No
No
Is source data Protected-B or above?
Is source data
Protected
-B or
above?
Yes
Yes
OpenAI not cleared by IT-SEC to use sensitive data
OpenAI not cleared by IT-
SEC to use sensitive data
Yes
Yes
Process large amount of documents with OpenAI?
Process large amount of
documents with OpenAI?
No
No
OpenAI web-chatbot Uses web interface to ask individual quest...
OpenAI web-chatbot Uses
web interface to ask
individual questions for a
given document.
No
No
Are prompts pre-determined?
Are prompts pre-
determined?
OpenAI-Script can group large amounts of documents for a give...
OpenAI-Script can group
large amounts of
documents for a given set
of prompts and process all
documents at once.
OpenAI WebApp uses web interface to upload a document with a ...
OpenAI WebApp uses
web interface to upload a
document with a set of
prompts and get OpenAI
responces
FoSx-SP-Waayback-LethalIndigowingedparrot
FoSx-SP-Waayback-
LethalIndigowingedparro
t
PSSI-OpenAI-RG
PSSI-OpenAI-RG
Storage Account
Storage Account
(pssidatalake)
(pssidatalake)
Web App
Web App
(pssi-openAI-prompts)
(pssi-openAI-prompts)
Document Intelligence
Document
Intelligence
(pssi-prebult-models)
(pssi-prebult-
models)
Data Factory
Data Factory
(pssi-pipelines)
(pssi-pipelines)
Databricks
Databricks
(pssi-openai-databricks)
(pssi-openai-
databricks)
Web App
Web App
(pssi-prd-pstb-rcoe-gc)
(pssi-prd-pstb-rcoe-gc)
Web App
Web App
(pssi-prd-emb-ispe-planningliterature)
(pssi-prd-emb-ispe-
planningliterature)
SSC Directory
SSC Directory
DFO Directory
DFO Directory
SQL Warehouse
SQL Warehouse
(emb-ipse)
(emb-ipse)
SQL Warehouse
SQL Warehouse
(pstb-rcoe)
(pstb-rcoe)
Azure OpenAI
Azure
OpenAI
(pssi-openai)
(pssi-openai)
SC2G - PROD ProB
SC2G - PROD ProB
EDH-PSSI-PROD-RG
EDH-PSSI-PROD-
RG
Storage Account
Storage Account
(stpssiprd)
(stpssiprd)
Document Intelligence
Document
Intelligence
(pssi-doc-ai-prd)
(pssi-doc-ai-prd)
Data Factory
Data Factory
(adfpsssiprdinnovation)
(adfpsssiprdinnovatio
n)
Web App
Web App
(pssi-openAI-chatbot)
(pssi-openAI-
chatbot)
EDH-PROD-RG
EDH-PROD-RG
Blob Container
Blob Container
(fm-bsc-fishslips)
(fm-bsc-fishslips)
Blob Container
Blob Container
(science-stockassesment-sil)
(science-stockassesment-sil)
SQL Warehouse
SQL Warehouse
(science-stockassesment)
(science-stockassesment)
SQL Warehouse
SQL Warehouse
(emb-ffhpp)
(emb-ffhpp)
SQL Warehouse
SQL Warehouse
(rm-dml)
(rm-dml)
SQL Warehouse
SQL Warehouse
(fm-rec)
(fm-rec)
Note: Document intelligence currently does not have the newes...
Note:
Document intelligence currently does not have the newest API avalible for region
"Central Canada" which outputs a confidence
score for extracted table data.
Note: Document intelligence currently does not have the newes...
Note
: Document intelligence currently does not have the newest API avalible for region "Central Canada" which
outputs a confidence score for extracted table data.
Note
: OpenAI cannot be deployed in the DFO directory hence all of the OpenAI related instances in SSC
Missing: Access to Databricks in EDH-PROD-RG with permissions...
Missing
: Access to Databricks in EDH
-
PROD
-
RG with
permissions to interact with Document Intelligence,
Data Factory, and Storage Account from EDH
-
PSSI
-
PROD
-
RG
Missing
: Instances of SQL Warehouses for each project
having data processed in EDH
-
PSSI
-
PROD
SQL Warehouse
SQL Warehouse
(qcfm-rec)
(qcfm-rec)
Databricks
Databricks
Web App
Web App
(pssi-prd-ocr-qcfm-rec)
(pssi-prd-ocr-qcfm-
rec)
Web App
Web App
(pssi-prd-ocr-emb-ffhpp)
(pssi-prd-ocr-emb-
ffhpp)
Web App
Web App
(pssi-prd-ocr-fm-rec)
(pssi-prd-ocr-fm-rec)
Web App
Web App
(pssi-prd-ocr-science-stockassesment)
(pssi-prd-ocr-
science
-
stockassesment)
Blob Container
Blob Container
(emb-ffhpp-complicance-inspectionreports)
(emb-ffhpp-complicance-
inspectionreports)
Blob Container
Blob Container
(qcfm-rec-seaobserver-logbooks-dockside-purchaseslips)
(qcfm-rec-seaobserver-
logbooks
-dockside-
purchaseslips)
Blob Container
Blob Container
(emb-ipse-planningliterature)
(emb-ipse-
planningliterature)
Blob Container
Blob Container
(pstb-rcoe-gc)
(pstb-rcoe-gc)
creates
creates
access
access
jsondata
jsondata
creates
creates
references
references
Document Intelligence Classification & Extraction Process (Cu...
Document Intelligence
Classification & Extraction
Process (Custom)
QA/QC Website interface
QA/QC Website interface
Digitized
Digitized
Document Intelligence Extraction (Pre-build)
Document Intelligence
Extraction (Pre
-build)
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Legend: - In house Application - SAS
Legend:
-
In house Application
-
SAS
Web App
Web App
pssi-openai-analyzer-prd-01
pssi-openai-analyzer-
prd
-01
Typed
Typed
Clear Formatting and Consistent Layout
Clear Formatting and
Consistent Layout
Select all fields
Select all fields
Web Page
Web Page
Field: verified (yes/no)
Field: verified
(yes/no)
Databricks – SQL Warehouse
Databricks – SQL Warehouse
Field: verified (yes/no)
Field: verified
(yes/no)
Table 2
Table 2
Table1
Table1
Table 3
Table 3
No
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
Blob Container
Blob Container
(bc16-website)
(bc16-website)
Web App
Web App
(bc16-website)
(bc16-website)
Azure OpenAI
Azure
OpenAI
Blob Container
Blob Container
(pstb-rcoe-gc)
(pstb-rcoe-gc)
Blob Container
Blob Container
(emb-ipse- planningliterature)
(emb-ipse-
planningliterature)
Schema
Schema
OpenAI – Code
OpenAI – Code
Copies (non duplicates)
Copies (non duplicates)
Github (code)
Github (code)
Repo (No saved endpoints and keys)
Repo
(No saved
endpoints
and keys)
Clone (local)
Clone (local)
DFO Azure Webapps
DFO Azure Webapps
Manual keys-entries
Manual
keys
-entries
Classification & Extraction Model
Classification &
Extraction Model
Extraction
Extraction
WebApps
WebApps
OCR - Code
OCR - Code
Databricks – Notebook (Script)
Databricks – Notebook
(Script)
Document Intelligence
Document
Intelligence
Databricks – Notebook (Script)
Databricks – Notebook
(Script)
OpenAI
OpenAI
Databricks
Databricks
PDF/folder#
PDF/folder#
...
...
Summary: User uploads files into population as individual fol...
Summary:
User uploads files into population as individual folder.
Databricks notebook to run classify scripts.
Replicate files into 'donottouch' without duplication with filename check.
Resort documents in 'donottouch' into proper folder names based on template name from custom
classification model.
Documents with low confidence scores are moved into 'trainingsamples/review'
Databricks notebook to run ocr scripts.
Documents that already contain a json output are ignored.
All other documents are processed with custom extraction model and json outputs are saved in
'jsondata' and 'jsondatamodified', script uses model naming convention to get latest model.
User opens ocr web app.
Documents are read from 'jsondatamodified' and visualized.
Edits and verification changes directly overwrite jsons in 'jsondatamodified'